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1. Introduction 

Consider a sample Z\ , . . . , Zn of independent random variables in some space 
Z, whose distribution depends on an unknown parameter /. To estimate /, 
we split the sample into two parts: a test set Z\ , . . . , Z n and a training set 
Z n+ i, . . . , Zn- Based on the training set various estimators of / are constructed, 
say fx, . . . ,f p . To decide among these estimators, we use the test set. Suppose 
that 7/ : Z — > R is a loss function. The final estimate / is now chosen to 
minimize the loss X)"=i 7/ C^i) : 

n 

/:=arg i min j) ^ 7/ .(^) . 

i— 1 

In this note, we examine whether this procedure leads to taking, among the 
p estimators, the "nearly best" one. Here, "nearly best" will be defined in terms 
of the excess risk of the estimators. 

The behavior of the excess risk near the true value of / will be called the 
margin behavior. We not only consider the classical case, which is quadratic 
margin behavior, but also more general margin behavior. For the tails of our 
excess loss functions, we consider both an exponential moment condition and a 
more general power tail condition. We prove a risk inequality under the most 
general combination of these conditions, and in doing so automatically obtain 

o 
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risk inequalities for more restricted situations. These latter situations represent 
examples we give from regression, classification and density estimation. 

Note that the aggregation we perform is model selection aggregation. There 
is a rich body of literature on which aggregation method is best under a vari- 
ety of conditions. Least-squares regression is discussed by H), which gives the 
optimal rates of a number of methods, including linear and convex aggrega- 
tion. A more general regression problem is addressed by (@). However, most of 
the literature deals with only one particular problem, such as regression, and 
also places strong conditions, like boundedness, on the functions and random 
variables involved. We obtain inequalities for a general loss function setup, and 
without boundedness conditions, at least when conditioning on the training set. 
Such conditioning on the training set is common practice; to average the results 
over the training data then requires more stringent conditions. 

Another fairly general approach is found in |l|), which looks at the general 
prediction problem, i.e. regression and classification, and uses a progressive mix- 
ture rule for aggregation, but with only a brief reference to averaging over the 
training stage, which would be part of the full sample splitting problem. On 
the other hand, (jlll ) looks at sample splitting schemes with multiple splits and 
thus comes close to crossvalidation, but does so only for the problem of density 
estimation. A direct treatment of a crossvalidation scheme is to be found in 
fl4h. And in the context of classification, recent inequalities are given for recur- 
sive aggregation by mirror descent in (0) and for aggregation with exponential 
weights by (fioh. 



1.1. Notation 

The results will be conditional on the training set. We use P to denote the 
distribution of the test sample, and E denotes expectation of random variables 
depending on the test sample. 
For 7 : Z — > R, we write 

1 " 

P 7 :=-^E 7 (Z l ) , 
n £ — ' 

i=l 
n 

n 

i=i 

Let 7j : Z — ► R, j = 1, . . . ,p be given loss functions in a class I\ Given the 
training set, 77 may be taken as short-hand (and slight abuse of) notation for 
7j?., j = 1, . . . ,p. We consider the estimator 

7 := arg min P^j . 

i<j<p 

The target is 

7 o := arg min P7 . 
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The best approximation is 



7* := arg min P 7j - . 
i<i<p 



We define the excess risks 

e := P( 7 - 7o ) , 

(which is a random variable), 

Ej := P{ lo - 70) 

and 

£* := P( 7 * - 7o ) . 

Without loss of generality, we assume that T is of the form T := {7/ : / £ F}, 
where F is a subset of a metric space with metric d, and write (with some abuse 
of notation) 7^. as 7j -, {/j}j=i C F. 

1.2. Goal 

Our goal is now to show that £/£* is close to 1 (with large probability or in 
expectation). The results are modifications of inequalities of the form 

(1 - S)B£ < (1 + 5% + ^ , 

a 

where S > is an arbitrary small constant, and with Ao of order log(2p)/n and 
not depending on £*, see for example Chapter 7 in (jy). In the standard setup 
of Section |U we for instance show that for 1 < m < 1 + logp 



with Ai and A2 both of order log(2p)/n, and both not depending on £». In 
particular, with m = 2, this reads 



E£ < ( t/sZ + \/Ai ' A > 



2 



A sharp oracle inequality would be 

E£ < + A . 

We conjecture that such sharpness cannot be established in a general setup by 
empirical risk minimization. Instead, e.g. mirror averaging could be used, see 
(@). See also (Q) and Qj for some limitations of empirical risk minimization, 
and alternative approaches to overcome the limitations. We however believe 
empirical risk minimization remains an important topic of study because it is 
widely applied in practice, and is closely related to various cross validation 
schemes. 
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1.3. Convex loss 

In our proofs, we only use the property 

Pnl < Pnl* ■ 

In the convex case, this means sometimes that conditions can be weakened. Let F 
be a convex subset of a linear vector space, and suppose that T := {7/ : / G F}, 
with / 1— > 7/ convex, P-almost everywhere. Then for < a < 1, we have the 
inequality 

This means that we can replace 7 by 7 Q y ! + ( 1 _ Q )j throughout, leading to in- 
equalities for the excess risk 

4 = -P7 Q /+(i_ Q)/ , -^7o ■ 

From these, one may then often deduce inequalities for the original d(f, fo). As 
we shall see, this extension (with a < 1) allows us to work with weaker conditions 
(than with a = 1). In particular, the example on maximum likelihood will use 
this approach with a set to 1/2. 

1-4- Organization of the paper 

The paper is organized as follows. Section [2] presents Bernstein's inequality. It is 
stated in the form of a probability inequality and a moment inequality. Section 
[3] presents the margin condition and some examples. In Section [U we consider 
the standard setup with quadratic margin, and bounded loss or an exponential 
moment condition. Section [5] looks at loss with power moment conditions, and 
Section [6] at general margin behavior under the exponential moment condition, 
giving risk tail bounds. Section [7] formulates the general risk moment inequality, 
from which the previous specific results follow. Finally, the proofs are in Section 

El 
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2. Bernstein's inequality 

Bernstein's inequality for a single average is well known, and the extension of 
Bernstein's probability inequality to a uniform probability inequality over p 
averages is completely straightforward. The result can be seen as the simplest 
version of a concentration inequality in the spirit of e.g. (0) (emphasizing how 
tight these general concentration inequalities are). The moment inequality for 
the maximum of p averages is perhaps less known. 
For all j, we let 

^(•):=7iO-E7i 

denote the centered loss functions. To obtain our results, we we make assump- 
tions on the tails of the centered excess losses 7? — 7^ or of their envelope 
T := maxi<j< p |t| — 7*| as follows: 

Definition 2.1. We say that the excess losses jj — 7, satisfy the exponential 
moment condition for some K > if 

P \H - 7*T < ^(2^) m - 2 d 2 (/„ /*) (1) 

for all m = 2, 3, . . . and for all j = 1, . . . ,p. 

We say that the envelope function T has power tails of order s > 1 if there 
exists an M € (0, 00) such that 

P({r>jf})<0Q vk>o. (2) 

Lemma 2.1. (Bernstein's inequality for the maximum of p averages) Let loss 
functions jj : Z — > R, j = 1, . . . ,p, be given. Assume that for some constant K 
and for all j , 

i' 
2 

Then for all t > 0, 



PhT < — (2* )"-\ m = 2,3,. 



\T-<j<p V n n J 

Moreover, for all 1 < m < 1 + logp ; 

E f„ ay |P„ 7 ;irV / %,/» + ^M . (4) 



In what follows, we will make repeated use of Bernstein's inequality. Hence, 
the term 21og(2p)/n will appear frequently. From now on, we denote this term 

by 

A . = 21og(2p) 



imsart-ejs ver. 2008/01/24 file: ejs_2008_254.tex date: June 25, 2008 



C. Mitchell and S. van de Geer/ Optimal oracle inequalities for model selection 5 

Remark: The moment inequality is for moments of order m < 1 + logp. It can 
be extended to hold for general to, provided a slight adjustment, depending on 
to, is made on the constants. Because we have the situation in mind where p is 
large, we have formulated the result for m < 1 + logp to facilitate the exposition. 

Corollary 2.1. (Weighted version of Bernstein's inequality) Assume that for 
some constant K , the condition 

Pb--iT<Y {2Kr ^ d2{f] ' f * 1 m = 2 ' 3 ---' v -? ( 5 ) 

holds. Then for all t > and r > 

1^(71-7^)1 . / A ■ — , K(A + 2t/n) \ 
~ iax mi j ^ — - V A + 2t n ^ ^ exp -t . 

Moreover, for all 1 < m < 1 + logp, 

/ f \Pn(H-tS)\ \ " l \ Vm /a a. 
E max — — ^— < V A H . 

Dchnc, for all 7, the variance 

a 2 ( 7 ) :=P| 7 C | 2 • 

Then clearly implies that 

d 2 (fjJ*)>« 2 (7j-l*), Vj . 
Moreover, if the bound \ jj — 7 *| < 3K holds V j, then ([5]) holds with 

d 2 (/ J -,r) = <7 2 ( 7j -- 7 *)vj. 

In what follows, we will indeed often assume ([5]) with this value for d(fj, /*), 
but we will also consider an extension. The choice of the metric d is intertwined 
with the margin behavior, which we consider in the next section. 
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3. Margin behavior 

Definition 3.1. We say that the margin condition holds with strictly convex 
margin function G(-), if 

P(7 J -7o)>G(d(/ J ,/ )) ! Vj . (6) 

Furthermore, we say that the margin condition holds with constants K > 1/2 
and C > 0, if fSJ) holds with 

G(u) = u 2k /C 2k , u > . 

As we shall see, K = 1 in typical cases - but other, in particular larger, values 
can also occur. 

Let us now consider some examples. In a regression or classification situation, 
we have i.i.d. random pairs Zi — (Xi, Yi), with Yi £ y C R a response variable, 
and Xi £ X a covariable, i = 1, . . . , n. We then assume (for i = 1, . . . , n) that 
the conditional distribution of Yi, given Xi — x, only depends on x and not on 
i. This can be done without loss of generality (as the index i can be taken in as 
an additional covariable). 

Example 3.1. (Regression) Suppose that {Zi}™ =1 := {(Xi, Y t )}f =1 . Let F be 
a class of real- valued functions on X , and for all x £ X and y £ y, let 

J f (x,y) := j(f(x),y), f £ F . 

Set 

l(a,-)=E(j(a,Y i )\X i = -), a£R. 
We moreover write lf{x) :— l(f(x), x). As target we take the overall minimizcr 

/o(-) := argmin;(a, •) . 
We now check whether the margin condition holds with k = 1 and 
d 2 (f,f ) :=K 2 P\f-fo\ 2 , 
where K% is an appropriate constant. 

Lemma 3.1. Assume that for some K\ > 0, and all \ f — /o| < K\, 

If - l f0 > (/ - fo) 2 /C 2 , (7) 

Then 

P{lf-lf )>d 2 {f,h)/C 2 , 
where C 2 := CqK\. If we moreover assume (for i = 1, . . . , n) that 

var( 7/ (Z l ) - 7/0 (Z. t )) < KlE(f(X t ) - fa{X l )) 2 , (8) 

then for all \\f — /o||oo < K\, we have 

^ 2 (7/-7/o) <d 2 {f,h) . 
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If l(a, •) has two derivatives near a = fo(-), and the second derivatives are 
positive and bounded away from zero, then l(a, •) behaves quadratically near its 
minimum, i.e., then |(7J) holds for some K\ > 0. 

It also also clear that (J5J) holds as soon as j(-,y) is Lipschitz for all y, with 
Lipschitz constant L. Then we may take K2 = L. When 7(-,y) is not Lipschitz 
(e.g., quadratic loss), it may be useful to define 

c / (Z i ):= 7 (/(X 4 ) ) y 4 )-i/(^i) ■ 

Then obviously 

var( 7/ (Z 2 ) - Vo (Zi)) = var(e / (Z i ) - e fa (Z t )) + varfopQ) - l fa (X t )) . (9) 

Note that with fixed design, the second term in @ vanishes. 

Quadratic loss: 
In the case of least squares, the loss function is 

i(f,y) ■= (y-/) 2 , 

Then 

h ~ ho = \f - M 2 1 

and 

e f (Zi) - e f0 (Zi) = 2e i (f(X i ) - f (X t )) , 

with ti := Yi — fo(Xi). Assuming that the conditional variance is bounded by 
some constant cr e , i.e., 

max var(y,|X l ) < a 2 e , (10) 

l<i<n 

we may conclude the following. 

Least squares with fixed design: 
The margin condition holds with k = 1 and C 2 = 4tr 2 . 

Least squares with random design: 
If II /j — /oil 00 < Ki f° r au 3i the margin condition holds with k = 1 and 
C 2 = Aa 2 e + K\ . 

Example 3.2. (Classification) Suppose that Z{ = (Xi,Yi), with Yi G y := 
{0, 1} a label, i = 1, . . . , n. Let F be a class of functions / : X — > [0, 1]. We 
consider 0/1-loss 

lf (x,y)= 1 (f(x),y):=(l-y)f(x)+y(l-f(x)), f e F, (x, y) G X x {0, 1} . 
For a £ [0,1], write 

Z(a,-):=E( 7 (a,y)pQ = -) 
= (1 - n)a + 77(1 - a) = a(l - 2n) + n , 
where 77 = E(Yj|JQ = ■). The target is again the overall minimizer 

fo := arg min l(a, •) . 
oe[Q,i] 
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It is clear that /o is the Bayes rule 

/o=l{l-27?<0} + g{l- 277 = 0} , 
with q an arbitrary value in [0,1]. We moreover have 

P(7f ~ 7/o )=P|(/-/o) (1-277)1 • 
Consider the functions 

Hi(v) < vPl{\l - 2ri\ <v}, ve [0, 1] , 

and 

Gi(u) — ma,x{uv — Hi(v)}, u S [0, 1] 

V 

(assuming the maximum exists). 
Lemma 3.2. The inequality 



p ilS ~ 7/ ) > G[v(lf ~ 7/o ) 

holds with G(u) = Gi(u 2 ), u E [0, 1]. 

If Hi(v) = for v < Ci , we take Gi(u) = C\u. More generally, the Tsybakov 
margin condition (see (jl2h ) assumes that one may take, for some C\ > 1 and 
7>0, 

H 1 {v)=v{C 1 vf/i , 

Then one has 

Gi(u) = u 1+7 /C 1+7 

where 

C= C 1 T ^7" T ^(l + 7) ■ 
Thus, then the margin condition holds with this value of C and with k = 1 + 7 
(and for any satisfying d(fj,f ) > a(jj - 70), V j). 

Example 3.3. (Maximum likelihood) Suppose that {Zi}f =1 are iid. with density 
/o := dP/dfi, where fi is a cr-finite dominating measure. Let F be a (convex, say) 
class of densities w.r.t. /j, containing /q. Consider the transformed log- likelihood 
loss 

7/(0 :=7(/(0)» 
where 7(a) = — log(a)/2. Define 

/= (/ + /*)/2, /GF. 
The squared Hellinger distance of densities / and / is 

h 2 {L f) = \j (yf- V/) d ^ /. / e F . 

We now check the margin condition with k — 1 and /o) = Ch(f, fo). 
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Lemma 3.3. For all densities f, we have 

P(lf-lf )>h 2 (f,f ) . 
Moreover, under the assumption 

V /* " 8 ' 

we have 

o-{lj~u,)<Ch{j,U) . 
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4. Quadratic margin and exponential moments 

The first case we shall look at is the one with quadratic margin condition (k = 1) 
and exponential moments on the tails of the loss functions. This encompasses e.g. 
regression with sub-Gaussian errors, as well as many situations where estimators 
and losses have absolute bounds. 

4- 1. General loss 

Lemma 4.1. Suppose that the margin condition holds, with constants n = 1 
and C > 0. Assume moreover that the loss functions satisfy the exponential 
moment condition (QJ) for some K > 0. Then for all t > 0, and when £* > 0. 




When 



£* < K(A + 2t/n) 



we have 




-t 



Moreover, for all 1 < to < 1 + logp, when £* > 0, we have 




and when 



£* < KA 



we have 



\fl <{C + 2Vk)Va . 



m 



Proof. All statements in this lemma are special cases of Lemma 16.21 



□ 



Corollary 4.1. (Asymptotics) When 



£^(K + C 2 ) A 



it holds (for m < 1 + logp) that 
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4-2. Maximum likelihood 

Define 



K := P ( 7 (/+/,)/2 ~ T/o) = £ i/2 



and 



£* := P(7/, -7/ ) = £* . 

Note that AC and /C* are Kullback-Leibler information numbers. Lemma 14.21 
below presents a version of Lemma |4. II for the maximum likelihood framework. 

Lemma 4.2. Suppose that 



7o < c 



TTien /or a?/ i > 0, and w/ien /C* > 0. 
/C 




>l + C\ 



A + 2t/n A + 2t/n 



K„ t . 



/C* 



< e" 



When 
we have 



/C« < A + 2t/n 



> (C + 2)v/A + 2</n^ < 



Moreover, for all 1 < m < 1 + logp, w/ien /C* > 0, we /iave 

~A~ A 



and w/ien 
we have 



< l + C\ 



K* < A 



< (C + 2)Va 



5. Quadratic margin, power tails 

5. J. Large values of p, power tails of the envelope function 

Lemma 5.1. Suppose that the margin condition holds, with constants k = 1 and 
C > 0, and some d satisfying d{fj, /o) > cr(7j —70); V j. Assume moreover that 
the envelope has power tails, i.e., that (0j holds for some s > 1 and M S (0, 00). 
TTien /or 1 < to < 1 + logp, and to < 2s, when £* > 0, 



< l + CW— +c,. — 



71/ 



2s-r, 

A— 
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where 
with 

Moreover, if 

we have 
where 



m \ 2s + m I 2s — m 
2s — to / V 2to 



C(a) = a'+. + a i+« , a > 



2 °+'" (2s — m 



MA— 



2s + rn 

2s + m „ / 2s - TO 



2s + TO 



Proof. The first moment inequality is a special case of Th.eorem l7.lf i), and the 
other statements are immediate consequences of it. □ 

Corollary 5.1. (Asymptotics) When 



then we have 



E I 



5.2. Lower bounds 



5.2.1. Large values of p 



Section 15.2.21 will show that (with m = 2) Lemma 15.11 can be improved if p is 
small compared to y/n. In this section, we present a lower bound where p = 
\fn + 1 (or larger), which shows that essentially, Lemma 15.11 (with to = 2) 
cannot be improved. For a fair comparison, the same conditions are imposed as 
in Lemma 14. II the margin condition, and the tail condition. 
We consider quadratic loss 

7/(-,y) = (y-/) 2 ■ 

Morover, we let X\, . . . , X n be fixed and 

Yi = fo(Xi) + ei, i=l,...,n, 

where e±,...,e n are i.i.d. copies of a random variable e, which has a double 
Pareto distribution, with parameter s > 2, i.e., the distribution of e is symmetric 
around 0, and 

P(\e\<u) = l- 1 u>0. 
(1 + it) 
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Now, suppose p = \pn + 1, f p := fo = 0, and that for j = 1, . . . ,p — 1 = y/n, 
fj(x) = \{x = X 3 }n^ , x £ X . 

Lemma 5.2. The margin condition holds with k = 1 andC 2 = 8/((s— 2)(,s— 1)), 
and the power tail condition (0) holds with M = 2. Moreover, for n > 2 2s , with 
probability at least 1 — cxp[— 2 _s ] we have 

£ > n'^ 1 . 

Remark One may easily extend the situation to p ^S> y/n, because one may 
add, as candidates, as many bounded functions fj, say H/jHoo 5: lj without 
destroying the moment condition (increasing M from M — 2 to M = 4). These 
added functions may be selected by the least squares estimator, but if they all 
all have norm P/? > n ~, selecting one of those still gives the same lower 
bound. 

5.2.2. Small values of p: the least squares case 

We consider again quadratic loss, and 

Yi = fo(Xi) + e l , i=l,...,n, 

with fixed design X%, . . . , X n and e%, . . . ,e n are i.i.d. copies of a random variable 
e with mean zero. Assume now a finite s-th moment 

M s := E\e\ s . 

We now show that a lower bound of order n ~ for E£ will not hold if p is 
small compared to i/n. 

Lemma 5.3. We have 

(f, (v 7 ^)^ S < Cc s p 1/s M/Vn~+ yfiZ , 

where 

Corollary 5.2. If p < -Jn it holds that 

E£ < (Cc s n-^M + y/rX . 
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6. General margin, exponential moments 



In this section, we weaken the margin condition to allow for parameter val- 
ues k > 1. Example 13.21 already showed us the necessity of this more general 
condition, as it overlaps with Tsybakov's margin condition. 

Lemma 6.1. Suppose that the margin condition holds, with strictly convex mar- 
gin function G. Let H be the convex conjugate of G. Assume that for some 
r < 1 +logp, the function H(v~), v > 0, is concave. Assume moreover that the 
exponential moment condition {!]] holds for some K > 0. Then for all < 8 < 1, 
and e > 0, we have 



+ (1 + 5)£, 



Lemma [73] is already set in the form of a non-sharp oracle inequality, rather 
than as a general bound on risk moments. In Section [JJ we will derive a similar 
oracle inequality for the margin condition with G(u) = u 2k /C 2k , but first we 
give the more general risk bound in this case: 

Lemma 6.2. Suppose that the margin condition holds, with constants k > 1 
and C > 0. Assume moreover that the exponential moment condition |7p holds 
for K > 0. Then for £* > 0, and all t > 0, we have 



P I f— f z 

where 

Moreover, for 



A(k) yC v / A + 2t/n 



K(A + 2t/n) 



< e" 



A(k) 



1 + (2k- 1)3s=t 
(2k) 

< K{A + 2t/n) , 



< 2 . 



we have 

P > A(k) (Cy/A + 2t/nj 2 "~ 1 + 2 (K(A + 2t/n)Y 

Furthermore, for all m < (1 + log p) (2k — 1), when £* > 0, 

KA\ ^ 



< e" 



(*)' 



and when 
we have 



< £^ + A(k) I CVA 
£*<KA , 

<a( k ) (cVa)~ -J 



2(KA)^ 
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Proof. The moment inequalities follow from Theorem 17. If ii). first taking t 2 := 
£*, and then t 2 := KA. The tail bounds follow from the same theorem by 
taking first t 2 := £*, then r 2 := K(A + 2t/n). □ 

Corollary 6.1 (Asymptotics). When 

E* > CA^i +KA , 
it holds (for all m < (1 + log p) (2k — 1)) that 




7. General margin & tails 



We now formulate our main theorem, whose proof also contains the proof of the 
moment bounds in Lemma 16.21 



Theorem 7.1. (i) Suppose that the margin condition holds for the loss func- 
tions 7j with constants k > 1 and C > and some d satisfying d(fj, /o) > 
a (lj ~ 7o)> ^ i- j4feo assume that the envelope T has power tails in the 
form of (HP for some s > 1 and M > 0. T/ien /or aZZ m m i/ie interval 
[2k, min(2sK, 1 + log(p))[ and for all r > 0, we have the following inequal- 
ity: 



(i) 2K <(J,Vrp+ A(k) ■ C a ■ A a ' 2 



where 



s,m) ■ M™"*+P . A^+? • (£» V r) 2 »"^ 



a := 



2k- 1 



m 2k 



A(k) := 



1 + (2k - 1) 



and 



£,(K,s,m) := A(k)^+^-2 2 «'°+3. 



2sk — m 



*+/3 



3 ■ 



(ii) Furthermore, if the excess losses satisfy the exponential moment condition 
(Jll for some constants K > 0, then 



< (£, Vt) s +A(/s) • C 



(£*Vr)i 
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In this case we also have tail bounds 



P ^ A > £p + A(k) \CJZ+Kfc + X(A +/ /n) j *" I < e" 
for allt>0 . 

These statements lead to simpler ones if we use that t<£\/t<£ + t and 
then optimize over t: 

Corollary 7.1. Under the conditions of Theorem \ 7. 1\ we have the inquality 



< £?" + A{k) ■ C a ■ A a/2 + £{k, s, m) ■ M^'^+fes • A^+^ 



when the loss envelope T has power tails p?)). and 



(it) 



when the excess losses satisfy the exponential moment condition ([7]). 
7.1. Special cases of Corollary \7. 1\ 

We can apply Corollary 17.11 to the (more restricted) cases described in the 
previous sections: 

Quadratic margin, power tails: Here k = 1 and thus a = 1, /3 = s/m— 1/2 
and A(k) = 1. Theorem 1 7 . 1 1 thus implies 

\fl < y/£^+-C- VA + f(l,«,m)-AfA • A^ . (f^sjcqra ) 



as in Lemma l5.1( the corresponding simplified version from Corollary 17. II is 
VI < a/^ + C- VA + ^(l,s,m)- VM-A 1 -- . 

m 

For m = 2, this implies 



f£< (1 + 5)V^* + ^- (C- A + £(l,s,2)- M • A" 

In the example of least-squares regression (Example 13. ip . we know that a 
quadratic margin condition holds, e.g. for the fixed design with C 2 := 4cr 2 . If 
furthermore we assume that the errors possess some finite moment of order 
2s > 2 - a less restrictive assumption than the Gaussianity often assumed - 
then the loss has power tails of order s > 1: 

7/ (x, y) = 7 (f(x),y) = {y- f(x)f = (e + f (x) - f(x)) 2 
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E[r s ] < 2 S E 



sup 7f(X,Y) 
feF 



< 2 is ~ 1 ■ E 



\e\ 2s + sup\f (X)-f(X)\ 



2.s 



fee 



yls-1 



E|e|^ + Esup|/ (X)-/(X)| 



2.s 



M 



f^F 



and so by Chebyshev, 



VK > 



General margin, exponential tails The risk bound in this case was given 
in Part (ii) of Corollary 17.11 whose correction term is of order 0(A 1 /( 4k ~ 2 )). 
This leads to an oracle inequality of the form 



P£ < (1 + £)£* + ^o(a^) V<5> 







In Example 13.21 we have already seen the margin condition for 
C = + 7) and k = 1 + 7, where 7 > 0, as a consequence of 

Tsybakov's margin condition. Furthermore, 

P h C f-l%\ m = P\(f(X)-f (X)).(l-2Y)-P\(f-f )(l-2 V )\\ m 
< 2 m - 2 .P\*f f - j% | 2 = 2 m - 2 ■ a 2 ( 7/ - 7/0) 

for all / in this example, which means that the excess losses have exponential 
moments (|TJ) with K = 1. Thus we have an oracle inequality 

£ < (l + <5)£* + i (li(Ci,7)- +I 2 - A 
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8. Proofs 

8.1. Proofs for Section{M 

Proof of Lemma 12. 11 Without loss of generality, suppose that E"fj(Zi) = 
for all i and j. Bernstein's probability inequality says that for all t > 0, 



P ^ ^ 7j (Z,) > 2Kt + VltJ < exp [-nt] , V j 
This inequality follows from the intermediate result 

n 

Eexp[^7j(Zi)/L] < exp 



2(L 2 - 2LK) 



Vj 



(11) 



(12) 



which holds for all L > 2K. Inequality ^ follows immediately from pip . 
To prove we apply Lemma [5TTI We then obtain for all L > 0, and all m 



E m.axl^^r <L m \og m 



E exp[max | 7, (Zj) | /L] - 1 



From (fl"2"|) . and invoking e' x ' < + e x , we obtain for L > 2K, 



L m log m 

< L m log™ 

< L m log™ 



Eexp[max | V jj(Zi)\/L] - 1 + e" 1 " 1 



p{2 exp 



2(L 2 - 2LK) 
(2p + e' m_1 -p)exp 



1} + e" 



Llog(2p + e m - 1 -p) 



2{p - 2LK) 
n 



2{L-2K) 



Now take 



L = 2K 



21og(2p + e m - 1 -p) 



□ 



Lemma 8.1. (Jensen's inequality for partly concave functions) Let X be a real- 
valued random variable, and let g be an increasing function on [0,oo) 7 which is 
concave on [c, oo) for some c > 0. Then 



E\X\+cP(\X\ < c) 
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Proof. We have 

E 5 (|X|) = Eg(\X\)l{\X\ >c} + E 5 (|X|)1{|X| < c} 
<Vg(\X\)l{\X\>c} + g(c)P(\X\<c) 



E 



0(1*1) 



\X\>c 



P(\X\>c)+g(c)P(\X\<c) 



We now apply Jensen's inequality to the term on the left, and then use the 
concavity on [c, oo) to incorporate the term on the right: 



E. 9 (|X|) < g 


e(\x\ 


\X\ > c) 


P(|X|>c) + 


g(c)P(\X\ < c) 




<9 


E|X| + cP(|X| <c) 







□ 



8.2. Proofs for Section[E 

Proof of Lemma 13. li This follows from 



□ 



Proof of Lemma 13.21 We have 

P\(f - /o)(l - 2ti)\ > vP\f - /o|l{|l - 2t7| > v} 

> v (P\f - /o| - P1{|1 - 2r?| < v}) := uv - H t (v) , 
with u = P\f — /o|. Since this is true for all v, we may maximize over v to obtain 

P\(f - / )(1 - 277)| > Gi (p\f - /o|) > Gi (P(f - fo 

as 

Moreover, 
so that 

<J 2 (l f - 7/o ) < Pilf ~ 7/„) 2 < P(/ - fof ■ 

□ 



P|/-/o|>P(/-/o) 2 • 

hf(y) - 7f (y)\ = \(f- /o)(i - 2y)| < |/ - /o| 

^ 2 (7/ - 7/o ) < Phf - 7/„) 2 < P(f - fof ■ 
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Proof of Lemma 13.31 Clearly 



p( as -7/0) 



/ / 

log J-rfodfi 

fo>Q V ->° 



> 



Jf o >0 V jo 



ffhd» = h 2 (f,f ) 
Moreover, 

«^(7/-7/J < -P(7/ -7/J 2 
Lemma 7.2 in (|l3[ ) says that 



2{exp | 7/ - 7/J - 17/ " 7/; I - 1} < 8(4 / -f - - l) 2 

V J* 



We moreover have 



Thus 



17/ - 7/, I 2 < 2{exp | 7/ - 7/J - 17/ - 7/, I - 1} 



<7 2 (7/ - 7/J < 8 / (V?- v/7*) 2 7% ^ C 2 h 2 (lU) 



8.3. Proofs for Section^ 
Proof of Lemma 16.11 Dehnc 



Z 



|(P»-P)(7-7*)I 



Then 



It follows that 



£ < ZG" 1 (£) + ZG- 1 (£* Ve)+£» 



< <5£ + ( - ] + (1 + 



(1 - 5)B£ < 25~EH (-)+(! + 



< 2SH E 



r\ 1/r 



< 2<5i? 



ifA 



<5 2 8G- l (£.Ve) 



(1 + *)£•* 
) + (1 + (5)5* 



(13) 



□ 



□ 
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8-4- Proofs for Section^ 

8.4-1- Preparatory lemmas 

We begin with two simple results (without proofs) for ease of reference. 

Lemma 8.2. If the loss envelope T has power tails then for all to < 2s and 

K > 0, 

ppm/2 1 r r > K x < m M s K -(2s-m)/2 _ 
2s — TO 

Lemma 8.3. For positive constants a, 6, a and [3, the function 

g(x) := ax a + bx~ p , x > 

is minimized at 

(bp 
x := — 
\aa 

and there attains a minimum of 

g a 

g(x ) = C(a, (3) x a°+?b^+? 

where 

Next we need an auxiliary lemma: 
Lemma 8.4. For all < z < 1, we have that 

(1 - zf K < 1 - 2kz 2k ~ 1 + (2k - 1)z 2k 

and for all z > 0, 

(1 + z) 2k > 1 + 2kz 2k ~ 1 + z 2k . 

Proof. The second part is clear, as it involves the omission only of positive 
summands from the LHS to the RHS. For the first part, we write 

f{z) := 1 - 2kz 2k - 1 + (2k - 1) z 2K - (1 - z) 2fi 

and note that 

f{z) = 1 - z 2n - (1 - z) ■ 2kz 2k - 1 - (1 - z) 2K 

(2k-1 \ 
J2 ^ - 2kz 2k " 1 - (1 - z) 2K - l \ 

2k-2 



C(a,(3) :-- 



= (i-zf- \ (i+i)^-(i- 

\ 3=0 

=: (l-z) 2 -/(z). 



n2k-2 

z) 
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Now as /(0) = and for < z < 1, 

(/)'(*) = E w + v**' 1 + ( 2k - 2) • (i - -a 2k - 5 > 



we know that f(z), and thus f(z), is non-negative on [0, 1] . □ 
Lemma 8.5. Let a, b and c be positive, let k > 1, and assume that 

a <b + c • (a^ + b^^j . 

Then 

«*<(l + (2.-l)-H£)^ +6 *. 

Proof. First note that if a 1 / 2K < (c/2k) 1 ^ 2k , then the desired inequality 
automatically holds. Thus we can restrict ourselves to the case where o}I 2k > 
(c/2k) 1/(2k " 1) . Applying the first part of Lemma[83]for z = (c/2k) 1/(2 " _1) /a 1 / 2 ' 1 
- which now is less than 1 - gives us the inequality 

(°* - (£) " - (2^1) (£) ^ - • 6* . 

and thus 



2re — 1 



, x ,1 (2k - 1 

< 6+ 2k -1 + c 

\ 2k 

where in the second step we used that k > 1. Now part 2 of Lemma [8^1 applied 
to z = • c) 172 ^ 1 /6 1 /2« > yields 

2k- 1 \~\ , , 1 /2k -1 \~ 

— -cj j > 6+ (2«-l).^ + (— -c. 

1 \ 2k 



,2k, 

from which the stated inequality follows. □ 
8.4-2. Main proof 

Proof of Theorem 17.11 (i) In the power tail case, we define 

SI :=£*Vt, 
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where r is a strictly positive number, and 

z |P„((7 c -7* c )i{r<if}) c | 

Then we have 

£ < |(P„-P)(7-7*)I+^* 
= |P„(7 c -7* c )l+£* 

< \P n ((7 C - 7* c ) i {r < K}f\ + \p ((7 C - 7 * c )i {r < K})\ 
+ \p n ((r-i c JHr>K})\+£» 

< cz (i& + +£* + (p n + p) (n {r > k}) 



< CZ[£- + {£l + (P n + P){Tl{T>K}))^ 

+£l + (p n + p) (n {r > k}) . 

Using Lemma 18.51 we obtain the inequality 



2k ) 

(£: + (P n + P) (T1{T>K})Y 



< l + (2«-l) 



2 k 



+ (£:)- + ((p n + p) (n {r > k}))* , 

where for the second step we used the elementary observation a 2K + b 2fi < 
(a + b) 2fi for a, b > 0, k > 1/2 . Now we will first compute the moments 
of Z by an application of Bernstein's inequality. We know that 



^|(7|-7* c )i{r< K}\ m < K m - 2 P (( 7j c -7* c )i{r<^}) 



and 



((7--7* c )i{r<^}) ; 



p 


Xis 


-it) 2 


i{r < 


K} 


p 


\iS 


-itf_ 






° 2 (7, - 


- 1*) 








- 7o ) - 


- (7* - 


7o)) 




(7i- 


-7o) + 


(7 (7* - 


-7o)) 2 



which by the margin condition 



< [C-(P ( 7j - lo)) 1/2K + C • (P (7. - 7o)) 1/2K 



c- 



(<5 



/2k, pi /2k 
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Thus for all j, 

(t| - 7* c ) i {r < k} 



p 



c 



m—1 



< 



< 



K 



P 



((7j-7* c )i{r<*Q) c 



c [e) I2k + {£;) 1/2K ) 



< 4- 



c (e) I2k + {s:) 1/2K 

\ m — 2 

C{Elf 2K ) 

2R \ 



and we can apply Corollary 12. II to obtain 



IZIL = 



P n m c - 7* c ) i {r < K}} 



2 VE- 



RA 



Now to compute the moments of 

(p„ + p) (n {r > r})^ , 

we proceed as follows for m > 2k (using that k > 1/2) 

((p„ + p) (ri{r> r})) 1/2k 

(E [((Pn+P) (Tl{T>R})) r 

(p n + p) (r m/2K i{v > r} 

1/m 



cm 



t U/2k 



i/2r 



1/r 



< ^2" l/2K_1 E 

= 2 1/2K (p(r m/2K i{r>if} 

By Lemma 18.21 for m < 2sk, this has an upper bound in 



2l/2fi 



2SK — TO 



1/r, 



/m j£1/2k— s/m 



Thus we find that for to € [2k, min{l + log(p), 2sk}) (and remembering 
that t 2 = £), 



(f) 2K +A(/s)-C , S^.(VA + 



where 



+S(k, s, to) • m s /" 1 X 1 / 2k - s /" 1 , 
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\2sk — m J 
If we now apply the straightforward bound 

and minimize the upper bound over K > (using Lemma l8-3[) . we obtain 
the desired oracle inequality for the power tail case, 
(ii) If we assume the exponential moment condition instead of power tails, we 
can take 

Z; _ jP n ((y- 7 c))| 

and we obtain the same bound for ||Z|| m as before, but no term stemming 
from n {r > K}. This yields the desired risk moment inequality. The cor- 
responding risk tail bound also comes straight from applying Bernstein's 
inequality to Z. 

□ 
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